Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition.
نویسندگان
چکیده
In molecular biology, the issue of quantifying the similarity between two biological sequences is very important. Past research has shown that word-based search tools are computationally efficient and can find some new functional similarities or dissimilarities invisible to other algorithms like FASTA. Recently, under the independent model of base composition, Wu, Burke, and Davison (1997, Biometrics 53, 1431 1439) characterized a family of word-based dissimilarity measures that defined distance between two sequences by simultaneously comparing the frequencies of all subsequences of n adjacent letters (i.e., n-words) in the two sequences. Specifically, they introduced the use of Mahalanobis distance and standardized Euclidean distance into the study of DNA sequence dissimilarity. They showed that both distances had better sensitivity and selectivity than the commonly used Euclidean distance. The purpose of this article is to extend Mahalanobis and standardized Euclidean distances to Markov chain models of base composition. In addition, a new dissimilarity measure based on Kullback-Leibler discrepancy between frequencies of all n-words in the two sequences is introduced. Applications to real data demonstrate that Kullback-Leibler discrepancy gives a better performance than Euclidean distance. Moreover, under a Markov chain model of order kQ for base composition, where kQ is the estimated order based on the query sequence, standardized Euclidean distance performs very well. Under such a model, it performs as well as Mahalanobis distance and better than Kullback-Leibler discrepancy and Euclidean distance. Since standardized Euclidean distance is drastically faster to compute than Mahalanobis distance, in a usual workstation/PC computing environment, the use of standardized Euclidean distance under the Markov chain model of order kQ of base composition is generally recommended. However, if the user is very concerned with computational efficiency, then the use of Kullback-Leibler discrepancy, which can be computed as fast as Euclidean distance, is recommended. This can significantly enhance the current technology in comparing large datasets of DNA sequences.
منابع مشابه
Evaluation of First and Second Markov Chains Sensitivity and Specificity as Statistical Approach for Prediction of Sequences of Genes in Virus Double Strand DNA Genomes
Growing amount of information on biological sequences has made application of statistical approaches necessary for modeling and estimation of their functions. In this paper, sensitivity and specificity of the first and second Markov chains for prediction of genes was evaluated using the complete double stranded DNA virus. There were two approaches for prediction of each Markov Model parameter,...
متن کاملA probabilistic measure for alignment-free sequence comparison
MOTIVATION Alignment-free sequence comparison methods are still in the early stages of development compared to those of alignment-based sequence analysis. In this paper, we introduce a probabilistic measure of similarity between two biological sequences without alignment. The method is based on the concept of comparing the similarity/dissimilarity between two constructed Markov models. RESULT...
متن کاملLecture 6 : CRFs for Computational Gene Prediction
One of the fundamental problems in computational biology is to identify genes in very long genome sequences. As we know DNA is a sequence of nucleotide molecules (a.k.a. bases) which encode instructions for generation of proteins. However not all of these bases are responsible for protein generation. As an example shown in the 4th slide on page 1 of [1], in the eukaryotic gene structure, only e...
متن کاملAn Analysis of Continuous Time Markov Chains using Generator Matrices
This paper mainly analyzes the applications of the Generator matrices in a Continuous Time Markov Chain (CTMC). Hidden Markov models [HMMs] together with related probabilistic models such as Stochastic Context-Free Grammars [SCFGs] are the basis of many algorithms for the analysis of biological sequences. Combined with the continuous-time Markov chain theory of likelihood based phylogeny, stoch...
متن کاملLecture 18 : CRFs for Computational Gene Prediction
One of the fundamental problems in computational biology is to identify genes in very long genome sequences. As we know, DNA is a sequence of nucleotide molecules (a.k.a. bases) which encode instructions for generation of proteins. However, not all of these bases are responsible for protein generation. As an example shown in the 4th slide on page 1 of [2], in the eukaryotic gene structure, only...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Biometrics
دوره 57 2 شماره
صفحات -
تاریخ انتشار 2001